Of all the great brain teaser and logic games out there, for my money nothing can beat debugging kernel mode software. None of them are as complex, vast, and downright satisfying as solving a complex interaction between complex software running on a complex platform. And if that wasn't good enough, debugging is incredibly self reinforcing. The more you do it the better you get and yet there always seems to be a more complex problem lurking around the corner ready to push the limits of your current skill set.
Hmm, maybe I'm on to something. Maybe someday the Sudoku section in the newspaper will be replaced by a kernel debugging section. I can see it now, my Mom sitting in a chair on Sunday morning with her reading glasses on wondering if the address was pageable or completely invalid. Shigeru Miyamoto will take to the game and soon people across the world will be playing Kernel Debug Heroes on the Nintendo Wii.
Back to Reality
Until that day comes, however, we're left to the small minority, stuck having only each other to entertain with war stories of particularly difficult problems we've debugged. One thing I find particularly fascinating in speaking with others about debugging problems is the process they used to come to the solution. What I find is that people who debug often have a core set of tools and commands they use to get their job done and everyone's is slightly different. Commands they use all of the time I've never heard of, elegant approaches they use to a particular class of problem make mine seem barbaric, and vice-versa.
For a long while I've wanted to put together an article on a random collection of approaches we use to solve particular problems for two reasons. One, I want to see if someone else out there can offer alternate, time saving approaches, and two I'd hope that it would inspire others to open up their chests and share what their unique methods are.
Thus, the following is a mixed bag of techniques we find ourselves going to constantly while tracking down issues.
Application X No Longer Works Properly
We come across these a fair amount, especially in the file system and file system filter space. A filter is installed and application "X" starts to misbehave. These bugs are hard, especially when the application in question is complex and you don't have source for it (Office applications are a particular pain point here).
The main question on these is where to start? Typically you either get a silent failure or a not very helpful error message such as, "Error Saving File". Failures of this type generally fall into one of two categories:
1) API call fails unexpectedly
2) API call succeeds but returns unexpected result
By far the second type of failure is much more difficult to track down, typically requiring quite a bit of time analyzing the application, trace logs, and information from utilities such as FileSpy or IrpTracker. While the first type can also be tracked down this way, there is a trick you can go to in an attempt to quickly track the problem.
The trick is based on this assumption: most applications written for Windows use the Win32 API either directly or indirectly. As most of us know, the Win32 API is built upon the underlying NT native API, which provides a superset of features actually exploited by the Win32 API.
Because I'm trying to confirm or deny that a failing API call is to blame, what I want to know is, "Which API calls are failing during the failure case?" This basically means finding a single point in the system that is always called as a result of an API failure. While there's nothing like this documented, I do know that all Win32 APIs that use native APIs must translate the native NTSTATUS value returned from the O/S into a Win32 API return value. And, as luck would have it, they all call the same API: RtlNtStatusToDosError. This used to be undocumented, but a Google search shows that MSDN has provided the API:
ULONG
RtlNtStatusToDosError(
IN NTSTATUS Status
);
What I want to do then is set a breakpoint on RtlNtStatusToDosError in the faulty application. When the breakpoint fires, I want to see the failure status and the call stack of the failing call. If all goes well, I'll see some obvious API failing and that will lead the rest of my analysis.
For fun, let's see an example. We'll trace trying to save a read only file in Notepad running on our target machine. We'll put a breakpoint in the status conversion routine that prints out the first parameter, the callstack, and then resumes. Note that this is an x86 target.
0: kd> !process 0 0
**** NT ACTIVE PROCESS DUMP ****
--
PROCESS 819645d0 SessionId: 0 Cid: 0250 Peb: 7ffdf000 ParentCid: 01ac
DirBase: 02080200 ObjectTable: e10f3200 HandleCount: 90.
Image: notepad.exe
0: kd> .process /r /p 819645d0
Implicit process is now 819645d0
.cache forcedecodeuser done
Loading User Symbols
................................
0: kd> bp ntdll!rtlntstatustodoserror "dd @esp+4 l1; k; g"
0: kd> g
When I now try to save the read only file I see the following in the debugger:
0007fa38 c0000022
ChildEBP RetAddr
0007fa30 7c80937a ntdll!RtlNtStatusToDosError
0007fa40 7c81109a kernel32!BaseSetLastNTError+0xf
0007fab0 01004ede kernel32!CreateFileW+0x390
0007faec 01002c2f NOTEPAD!SaveFile+0x30
0007fd6c 01003927 NOTEPAD!NPCommand+0xa8
0007fd90 77d48734 NOTEPAD!NPWndProc+0x4fe
0007fdbc 77d48816 USER32!InternalCallWinProc+0x28
0007fe24 77d4b4c0 USER32!UserCallWinProcCheckWow+0x150
0007fe78 77d4b50c USER32!DispatchClientMessage+0xa3
0007fea0 7c90eae3 USER32!__fnDWORD+0x24
0007fea0 80500a60 ntdll!KiUserCallbackDispatcher+0x13
f7ee3b04 805a0771 nt!KiCallUserMode+0x4
f7ee3b60 bf813e47 nt!KeUserModeCallback+0x87
f7ee3be4 bf813fdd win32k!SfnDWORD+0xa8
f7ee3c2c bf8141cf win32k!xxxSendMessageToClient+0x176
f7ee3c78 bf80f5b8 win32k!xxxSendMessageTimeout+0x1a6
f7ee3c9c bf8e14ac win32k!xxxSendMessage+0x1b
f7ee3ce4 bf8e12a5 win32k!xxxTranslateAccelerator+0x264
f7ee3d50 8054060c win32k!NtUserTranslateAccelerator+0x85
f7ee3d50 7c90eb94 nt!KiFastCallEntry+0xfc
While this doesn't immediately provide the solution, it does give a starting point for the investigation.
Breaking on a Particular API Call From a Particular Driver
I often find myself in need of setting a breakpoint on a particular API but only when it's called by a particular driver.
For example, say I want to inspect every time the floppy disk driver completes an IRP with IoCompleteRequest. If I try to set a breakpoint in IoCompleteRequest I'm never going to be able to weed through all of the IRPs completing that I don't care about, I really just want the IRPs that the floppy disk driver completes.
Luckily, there's a feature of the Windows PE format that we can exploit: the import address table. Modules that call exports of other modules have something called an IAT which is fixed up by the loader to contain pointers to the locations of the routines in memory. A call "through" the IAT shows up as a pointer dereference in the debugger. For example:
call dword ptr [flpydisk!_imp_IofCompleteRequest (f9cacbf8)]
This instruction dereferences a value that lies somewhere within the floppy disk module that the loader fills in with the location of the real IoCompleteRequest:
1: kd> dps f9cacbf8 l1
f9cacbf8 804eef48 nt!IofCompleteRequest
We can then exploit this to our advantage. If we care to see all of the calls that the floppy disk driver makes to IoCompleteRequest we can set an access breakpoint on its IAT entry that points to the real IoCompleteRequest.
To find the value, we can use the X command to look for the symbol that represents the import:
1: kd> x Flpydisk!*iofcompleterequest
f9cacbf8 flpydisk!_imp_IofCompleteRequest = <no type information>
Then use the break on access command to stop on the pointer deref (pointers are 4 bytes on the x86, hence the "r4").
1: kd> ba r4 f9cacbf8
1: kd> g
Breakpoint 0 hit
nt!IofCompleteRequest:
804eef48 ff2584475580 jmp dword ptr [nt!pIofCompleteRequest (80554784)]
1: kd> k
ChildEBP RetAddr
f7fefd68 f9cade97 nt!IofCompleteRequest
f7fefd7c f9caf2f8 flpydisk!FlFinishOperation+0xab
f7fefdac 805ce794 flpydisk!FloppyThread+0x24a
f7fefddc 805450ce nt!PspSystemThreadStartup+0x34
00000000 00000000 nt!KiThreadStartup+0x16
Breaking on a Particular API Call From a Particular Driver When You Don't Have Symbols
The above was easy enough because we had symbolic information for the floppy disk driver, thus it was simple to find the address of the IoCompleteRequest import using the X command.
Let's try to do this again with a driver that we don't have symbols for, such as vmscsi.sys:
1: kd> x vmscsi!*
*** ERROR: Module load completed but symbols could not be loaded for vmscsi.sys
Oops, so much for that. So, how can we achieve the same result? Quite easily, actually, all we need to do is find the IAT of the module and figure out the module address for ourselves. First thing we'll need is the base address of the vmscsi module. We can do this with the Loaded Modules command and a match string:
1: kd> lm mvmscsi
start end module name
f9db8000 f9dbab00 vmscsi (no symbols)
Once we have the base address, we can use the Display Header command to inspect the PE header.
1: kd> !dh f9db8000
File Type: EXECUTABLE IMAGE
FILE HEADER VALUES
14C machine (i386)
6 number of sections
461D2114 time date stamp Wed Apr 11 13:55:32 2007
0 file pointer to symbol table
0 number of symbols
E0 size of optional header
102 characteristics
Executable
32 bit word machine
OPTIONAL HEADER VALUES
10B magic #
8.00 linker version
1A00 size of code
C80 size of initialized data
0 size of uninitialized data
1ADC address of entry point
480 base of code
----- new -----
801c0000 image base
80 section alignment
80 file alignment
1 subsystem (Native)
6.00 operating system version
6.00 image version
5.00 subsystem version
2B00 size of image
480 size of headers
12140 checksum
00040000 size of stack reserve
00001000 size of stack commit
00100000 size of heap reserve
00001000 size of heap commit
0 [ 0] address [size] of Export Directory
1E80 [ 3C] address [size] of Import Directory
2100 [ 828] address [size] of Resource Directory
0 [ 0] address [size] of Exception Directory
2B00 [ 1B30] address [size] of Security Directory
2980 [ A8] address [size] of Base Relocation Directory
1C50 [ 1C] address [size] of Debug Directory
0 [ 0] address [size] of Description Directory
0 [ 0] address [size] of Special Directory
0 [ 0] address [size] of Thread Storage Directory
0 [ 0] address [size] of Load Configuration Directory
0 [ 0] address [size] of Bound Import Directory
1C00 [ 4C] address [size] of Import Address Table Directory
0 [ 0] address [size] of Delay Import Directory
0 [ 0] address [size] of COR20 Header Directory
0 [ 0] address [size] of Reserved Directory
...
Note this line in the output:
1C00 [ 4C] address [size] of Import Address Table Directory
This gives us the offset and size of the IAT. Because the IAT is simply an array of pointers filled in by the loader, we can inspect each pointer and what it points to by dumping them out with the dps command.
1: kd> dps f9db8000+1c00 l4c/4
f9db9c00 806eaa44 hal!READ_PORT_UCHAR
f9db9c04 806eaaac hal!WRITE_PORT_UCHAR
f9db9c08 f97ec0d4 SCSIPORT!ScsiPortStallExecution
f9db9c0c f97ebfe4 SCSIPORT!ScsiPortGetPhysicalAddress
f9db9c10 f97ebf4c SCSIPORT!ScsiPortMoveMemory
f9db9c14 f97ec46a SCSIPORT!ScsiPortCompleteRequest
f9db9c18 f97ec0e6 SCSIPORT!ScsiPortNotification
f9db9c1c f97ebf9e SCSIPORT!ScsiPortGetSrb
f9db9c20 f97ec068 SCSIPORT!ScsiPortGetVirtualAddress
f9db9c24 f97ec0ae SCSIPORT!ScsiPortConvertUlongToPhysicalAddress
f9db9c28 f97ebecc SCSIPORT!ScsiPortLogError
f9db9c2c f97e8508 SCSIPORT!ScsiPortGetUncachedExtension
f9db9c30 f97e7afa SCSIPORT!ScsiPortGetDeviceBase
f9db9c34 f97ec0a4 SCSIPORT!ScsiPortValidateRange
f9db9c38 f97f2e0e SCSIPORT!ScsiPortInitialize
f9db9c3c 00000000
f9db9c40 80539bc0 nt!memset
f9db9c44 80539540 nt!memcpy
f9db9c48 00000000
The addresses on the left are now our "break on access" values and we can use them as we did in the previous example.
0: kd> ba r4 f9db9c18
0: kd> g
Breakpoint 0 hit
SCSIPORT!ScsiPortNotification:
f97ec0e6 8bff mov edi,edi
0: kd> k
ChildEBP RetAddr
f9e078c8 f9db98ab SCSIPORT!ScsiPortNotification
WARNING: Stack unwind information not available. Following frames may be wrong.
f9e078e8 f9db99d9 vmscsi+0x18ab
f9e078fc f97e9dd2 vmscsi+0x19d9
Breaking on a Particular Pool Tag
This is a pretty well known trick at this point but I'm always surprised that people haven't picked up on it since I find it incredibly useful. Say you're having a memory leak on your system and there is a particular pool tag implicated from the !poolused output. The tag doesn't belong to your driver and you'd like to know where it's coming from.
The first option you might think of would be a conditional breakpoint on ExAllocatePoolWithTag. This would work just fine, but as we know, conditional breakpoints can be excruciatingly slow as the debugger needs to break, evaluate the condition, then resume if it doesn't match.
Turns out there's a better way, the "PoolHitTag" global. Simply fill it in with the tag of your choosing and the O/S will break into the debugger whenever that tag is allocated. Let's set a breakpoint in ExAllocatePoolWithTag, see what the tag used is, then see if we can catch it when we hit GO.
0: kd> bp nt!exallocatepoolwithtag ".formats poi(@esp+c); k L4"
0: kd> g
Evaluate expression:
Hex: 63416553
Decimal: 1665230163
Octal: 14320262523
Binary: 01100011 01000001 01100101 01010011
Chars: cAeS
Time: Sat Oct 08 07:56:03 2022
Float: low 3.56752e+021 high 0
Double: 8.22733e-315
ChildEBP RetAddr
f9dc339c 805e3e14 nt!ExAllocatePoolWithTag
f9dc33c4 805e4223 nt!RtlpInheritAcl+0x28
f9dc34a0 805ef63b nt!RtlpNewSecurityObject+0x387
f9dc34cc 805be847 nt!SeAssignSecurity+0x4f
nt!ExAllocatePoolWithTag:
8054a968 8bff mov edi,edi
0: kd> ed nt!poolhittag 'cAeS'
0: kd> g
Break instruction exception - code 80000003 (first chance)
nt!DbgBreakPoint:
8052a5cc cc int 3
0: kd> k
ChildEBP RetAddr
f9dc3300 80535bac nt!DbgBreakPoint
f9dc3334 8054b16b nt!ExpInsertPoolTracker+0x1c
f9dc339c 805e3e14 nt!ExAllocatePoolWithTag+0x803
f9dc33c4 805e4223 nt!RtlpInheritAcl+0x28
f9dc34a0 805ef63b nt!RtlpNewSecurityObject+0x387
f9dc34cc 805be847 nt!SeAssignSecurity+0x4f
As indicated by the call stack, this trick only works if pool tagging is enabled on the target system.
Manually Recreating a Stack
This one is always a fun one. For one reason or another the stack is hosed. The biggest culprit is generally a module on the stack that uses FPO for which you don't have symbols. Here, I shows a recent example I ran into.
bd74e438 8045163d nt!IopParseDevice+0xa04
bd74e4ac 804a4e8a nt!ObpLookupObjectName+0x4d5
bd74e5bc 80496b85 nt!ObOpenObjectByName+0xc5
bd74e690 be7728c6 nt!IoCreateFile+0x3ec
bd74e6e0 be77382b savrt+0x458c6
bd74e6ec be7752a5 savrt+0x4682b
bd74e768 be744773 savrt+0x482a5
bd74e77c be742e4b savrt+0x17773
bd74e7a4 be745e75 savrt+0x15e4b
00000000 00000000 savrt+0x18e75
Notice how the stack disappears into that module. This is a clear sign that there is more stack to be seen but WinDBG is no longer able to reconstruct it properly. Generally, getting a call stack for the missing piece can be a real pain. However, there is a slacker approach to get "close enough" with minimal effort. All it requires is some ability to recognize EBP frames on the stack.
For some detailed into on EBP frames, check out Stacking the Deck (The NT Insider, Volume 9, Issue 6, November-December 2002). However, the main point you'll need to know about EBP frames is that EBP points to the previous EBP on the stack and EBP+4 points to the return address.
Our trick will involve inspecting the raw contents of the stack after the corrupt point looking for a pointer to a stack address followed by something that looks like an instruction address. If we follow the pointer to the stack address we should see another stack address followed by another instruction address, and so on.
kd> dps bd74e7a4
bd74e7a4 00000000
bd74e7a8 be745e75 savrt+0x18e75
bd74e7ac e5d590b0
bd74e7b0 e369d7a8
bd74e7b4 be74355a savrt+0x1655a
bd74e7b8 bd74e7c8
bd74e7bc e369d7a8
bd74e7c0 e369d7a8
bd74e7c4 be741b0f savrt+0x14b0f
bd74e7c8 00000000
bd74e7cc e5da1e68
bd74e7d0 e5da1e70
bd74e7d4 be741c26 savrt+0x14c26
bd74e7d8 00000000
bd74e7dc e5da1e70
bd74e7e0 e1bffb48
bd74e7e4 be8329bc SYMEVENT+0xd9bc
bd74e7e8 00000000
bd74e7ec e5da1e70
bd74e7f0 bd74e810
bd74e7f4 be834e74 SYMEVENT+0xfe74
bd74e7f8 e5da1e70
bd74e7fc e1374c28
bd74e800 e5da1e68
bd74e804 bd74e870
bd74e808 e1374c30
bd74e80c e1374c28
bd74e810 bd74e824
bd74e814 be835023 SYMEVENT+0x10023
bd74e818 e5da1e68
bd74e81c bd74e870
bd74e820 e1374c48
bd74e824 bd74e838
bd74e828 be82b07f SYMEVENT+0x607f
bd74e82c bd74e870
Do you notice the pattern of the numbers in underline? If you keep following the pointer values you keep finding stack values followed by addresses somewhere inside the SYMEVENT module. This is exactly what an EBP frame looks like on the stack. If you can find this pattern you can get pretty close to finding the real stack with a quick trick.
Take the second frame from the top that you found and feed the possible EBP address, that address minus eight, and the instruction address found in the first frame from the top to the K command.
kd> k = bd74e810 bd74e810-8 be834e74
ChildEBP RetAddr
WARNING: Stack unwind information not available. Following frames may be wrong.
bd74e810 be835023 SYMEVENT+0xfe74
bd74e824 be82b07f SYMEVENT+0x10023
bd74e838 be833d28 SYMEVENT+0x607f
bd74e854 be82c7b9 SYMEVENT+0xed28
bd74e894 bfd7fd2a SYMEVENT+0x77b9
bd74e8fc 8041fb8b <EDIT - removed module name>+0xcd2a
bd74e910 8049c945 nt!IopfCallDriver+0x35
bd74ea98 8045163d nt!IopParseDevice+0xa04
bd74eb0c 804a4e8a nt!ObpLookupObjectName+0x4d5
bd74ec1c 80496b85 nt!ObOpenObjectByName+0xc5
bd74ecf0 80497f27 nt!IoCreateFile+0x3ec
bd74ed30 80465691 nt!NtCreateFile+0x2e
bd74ed30 77f8f9c5 nt!KiSystemService+0xc4
0006e808 00000000 ntdll!NtCreateFile+0xb
Again, this is a shortcut to getting a stack back and isn't the entirely correct way to calculate the values. However, it has personally not failed me once in getting close enough to analyze the problem with minimal effort.
That's All For Now
Obviously this is just a start and there are many, many more techniques to be covered. Hopefully in future issues we can go over more of our techniques and even some of yours!